A Web Site Mining Algorithm Using the Multiscale Tree Representation Model

نویسندگان

  • YongHong Tian
  • TieJun Huang
  • Wen
چکیده

Web site mining, which aims at automatically discovering and classifying topic-specific web sites from the World Wide Web, has attracted increasing attention as indicated by the exponential growth of both the amount and the diversity of the web information. This paper describes a novel multiscale approach for web site mining, which represents a web site as a multiscale site tree, extending the existing tree representation models of web sites to an extra level of resolution (Document Object Model or DOM nodes). Furthermore, the hidden Markov tree (HMT) is utilized to model the intrascale contextual dependencies in the multiscale site tree, and a contextbased fusion algorithm is applied to combining the interscale context models with the HMT-based classifiers in order to refine the raw classification results. Moreover, for further improving classification accuracy while reducing the classification overheads, we introduce a twostage text-based denoising procedure to remove the “noise” information within web sites, and an entropybased approach to dynamically prune the site trees. Experiments show that our approach achieves in average 16% improvement in classification accuracy and 34.5% reduction in processing time over the baseline system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

Classification of Web Log Data to Identify Interested Users Using Naïve Bayesian Classification

Web Usage Mining (WUM) is the process of extracting knowledge from Web user’s access data by exploiting Data Mining technologies. It can be used for different purposes such as personalization, system improvement and site modification. Study of interested web users, provides valuable information for web designer to quickly respond to their individual needs. The main objective of this paper is to...

متن کامل

Web Categorisation Using Distance-Based Decision Trees

In Web classification, web pages are assigned to pre-defined categories mainly according to their content (content mining). However, the structure of the web site might provide extra information about their category (structure mining). Traditionally, both approaches have been applied separately, or are dealt with techniques that do not generate a model, such as Bayesian techniques. Unfortunatel...

متن کامل

Ensemble of M5 Model Tree Based Modelling of Sodium Adsorption Ratio

This work reports the results of four ensemble approaches with the M5 model tree as the base regression model to anticipate Sodium Adsorption Ratio (SAR). Ensemble methods that combine the output of multiple regression models have been found to be more accurate than any of the individual models making up the ensemble. In this study additive boosting, bagging, rotation forest and random subspace...

متن کامل

A Technique for Improving Web Mining using Enhanced Genetic Algorithm

World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003